Even with the introduction of the dplyr
package, it is often quicker to transform data using functions in base R. In this document, we will demonstrate how some of the functionality of dplyr
can be replicated in base R. We’ll use the built-in mtcars
datset as a running example.
library(dplyr)
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
select
In dplyr
, the select
function is used to choose the columns we want to keep. This functionality can be replicated easily in base R. Recall that a data frame can be viewed as a matrix as well; hence, we can pick out the columns we want by using square brackets and inserting a vector after the comma within the square brackets. For example, we we want the mpg
and drat
columns, we could do
df <- mtcars[, c("mpg", "drat")]
head(df)
## mpg drat
## Mazda RX4 21.0 3.90
## Mazda RX4 Wag 21.0 3.90
## Datsun 710 22.8 3.85
## Hornet 4 Drive 21.4 3.08
## Hornet Sportabout 18.7 3.15
## Valiant 18.1 2.76
If we knew the column numbers, we could simply specify that:
df <- mtcars[, c(1, 5)]
head(df)
## mpg drat
## Mazda RX4 21.0 3.90
## Mazda RX4 Wag 21.0 3.90
## Datsun 710 22.8 3.85
## Hornet 4 Drive 21.4 3.08
## Hornet Sportabout 18.7 3.15
## Valiant 18.1 2.76
To select all columns except those specified, add a minus sign in front of the vector:
df <- mtcars[, -c(1, 5)]
head(df)
## cyl disp hp wt qsec vs am gear carb
## Mazda RX4 6 160 110 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 6 160 110 2.875 17.02 0 1 4 4
## Datsun 710 4 108 93 2.320 18.61 1 1 4 1
## Hornet 4 Drive 6 258 110 3.215 19.44 1 0 3 1
## Hornet Sportabout 8 360 175 3.440 17.02 0 0 3 2
## Valiant 6 225 105 3.460 20.22 1 0 3 1
mutate
To replicate mutate
’s functionality, recall that under the hood, a data frame is a list, with the column names being keys and the values in the column being the values. Hence, we can use list notation to specify a new column.
For example, if we wanted to create a new column kml
to denote kilometers per hour (instead of mpg
miles per gallon), we could do
mtcars$kml <- mtcars$mpg * 1.609 / 3.785
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## kml
## Mazda RX4 8.927081
## Mazda RX4 Wag 8.927081
## Datsun 710 9.692259
## Hornet 4 Drive 9.097120
## Hornet Sportabout 7.949353
## Valiant 7.694293
Note that we have to type mtcars$mpg
on the right instead of just mpg
. That is because mpg
in itself is not an object in our global environment.
We can use the same syntax to overwrite an existing column. For example, the original wt
column gives the weight of the car in 1000 lbs. The code below changes the column to give the weight of the car in kilograms:
mtcars$wt <- mtcars$wt * 1000 / 2.2046
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 1188.424 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 1304.091 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 1052.345 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 1458.314 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 1560.374 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 1569.446 20.22 1 0 3 1
## kml
## Mazda RX4 8.927081
## Mazda RX4 Wag 8.927081
## Datsun 710 9.692259
## Hornet 4 Drive 9.097120
## Hornet Sportabout 7.949353
## Valiant 7.694293
If we want to define a new column based on more than one of the original columns, we can do so but have to keep typing out the name of the data frame:
mtcars$nonsense <- mtcars$vs + mtcars$am + mtcars$gear
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 1188.424 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 1304.091 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 1052.345 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 1458.314 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 1560.374 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 1569.446 20.22 1 0 3 1
## kml nonsense
## Mazda RX4 8.927081 5
## Mazda RX4 Wag 8.927081 5
## Datsun 710 9.692259 6
## Hornet 4 Drive 9.097120 4
## Hornet Sportabout 7.949353 3
## Valiant 7.694293 4
For more succinct code, we can use the with
function: it allows us to type just the column names.
mtcars$nonsense <- with(mtcars, vs + am + gear)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 1188.424 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 1304.091 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 1052.345 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 1458.314 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 1560.374 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 1569.446 20.22 1 0 3 1
## kml nonsense
## Mazda RX4 8.927081 5
## Mazda RX4 Wag 8.927081 5
## Datsun 710 9.692259 6
## Hornet 4 Drive 9.097120 4
## Hornet Sportabout 7.949353 3
## Valiant 7.694293 4
arrange
First, let’s restore the original mtcars
dataset:
data(mtcars)
Arranging the rows in base R is a little complicated. Let’s say we want to order the dataset according to mpg
in ascending order. We first call the order
function on mtcars$mpg
, which returns a vector telling us which row has the smallest mpg, then the second smallest mpg, and so on. From the result below, it looks like row 15 has the smallest mpg
value, followed by row 16.
order(mtcars$mpg)
## [1] 15 16 24 7 17 31 14 23 22 29 12 13 11 6 5 10 25 30 1 2 4 32 21
## [24] 3 9 8 27 26 19 28 18 20
To sort the whole dataset according to mpg
, we use square brackets and insert the output of order
before the comma within the square brackets:
mtcars[order(mtcars$mpg), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Sorting in descending order can be done by specifying decreasing = TRUE
in the order
function:
head(mtcars[order(mtcars$mpg, decreasing = TRUE), ])
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
order
can take more than one argument: if two rows tie on the first argument, the second argument is used to break ties, and so on. The code below sorts the rows by cyl
, then vs
, then mpg
:
mtcars[with(mtcars, order(cyl, vs, mpg)), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
filter
If I wanted to look at just the cars with 8 cylinders (i.e. cyl == 8
), I could do it the following way using dplyr
(notice the double equal sign to test for equality; signal equal sign represents assignment instead):
mtcars %>% filter(cyl == 8)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 2 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 3 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 4 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 5 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 6 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 8 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 9 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## 10 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## 11 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 12 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 13 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 14 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
To do this in base R, we first need to know another way of extracting elements from a vector. Consider the vector below:
vec <- 1:3
To extract a group of elements from vec
, we previously used square bracket notation, with a vector of indices that we wanted to extract:
vec[c(1,2)]
## [1] 1 2
Another way to extract elements is by putting a logical vector of the same length in the square brackets. R will then extract those elements which match to TRUE
. For example, the code below extracts the first and third elements:
vec[c(TRUE, FALSE, TRUE)]
## [1] 1 3
To extract all the observations with exactly 8 cylinders, we can do this:
df <- mtcars[mtcars$cyl == 8, ]
table(df$cyl)
##
## 8
## 14
To extract observations with less than 8 cylinders:
df <- mtcars[mtcars$cyl < 8, ]
table(df$cyl)
##
## 4 6
## 11 7
To extract observations such that the number of cylinders is not 8:
df <- mtcars[mtcars$cyl != 8, ]
table(df$cyl)
##
## 4 6
## 11 7
summarize
and group_by
If we wanted the mean mpg
for the table, we would do the following in dplyr
:
mtcars %>% summarize(mean_mpg = mean(mpg))
## mean_mpg
## 1 20.09062
In the above, mean_mpg
is the name of the new column, while mean(mpg)
is the function to be applied to the data frame. In base R, we would just apply the function directly:
mean(mtcars$mpg)
## [1] 20.09062
dplyr
’s summarize
function has the power to perform multiple summaries at the same time and have the result be a data frame as well. There is no base R equivalent for this functionality.
dplyr
’s summarize
function also works really well with group_by
to give summaries for subgroups. For example, it is easy to get the mean mpg
value for cars of different cyl
values:
mtcars %>% group_by(cyl) %>%
summarize(mean_mpg = mean(mpg))
## # A tibble: 3 x 2
## cyl mean_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
It turns out that base R can do this too, but such code is rare nowadays:
by(mtcars, mtcars$cyl, function(x) mean(x$mpg))
## mtcars$cyl: 4
## [1] 26.66364
## --------------------------------------------------------
## mtcars$cyl: 6
## [1] 19.74286
## --------------------------------------------------------
## mtcars$cyl: 8
## [1] 15.1